autodb: splitting up tables automatically

Mark Webster

2025-07-03

ChickWeight (\(578 \times 4\))

Time Chick weight Diet
0 1 42 1
0 2 40 1
0 3 43 1
0 4 42 1
2 1 51 1
2 2 49 1
2 3 39 1
2 4 49 1

ChickWeight (\(578 \times 3\), \(50 \times 2\))

Time Chick weight
0 1 42
0 2 40
0 3 43
0 4 42
2 1 51
2 2 49
2 3 39
2 4 49
Chick Diet
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1

Database normalisation

1NF 1NF 2NF 2NF 1NF->2NF 3NF 3NF 2NF->3NF BCNF BCNF 3NF->BCNF 4NF 4NF BCNF->4NF 5NF 5NF 4NF->5NF 6NF 6NF 5NF->6NF

Tidy data

In tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

This is Codd’s 3rd normal form [i.e. database normalisation], but with the constraints framed in statistical language…”

~ Wickham, Tidy data (2014)

What if we don’t know the structure?

No expected structure, or the data’s actual structure is different:

  • No documentation
  • Faulty documentation
  • Data entry errors

autodb

autodb turns a table into a database:

chick_db <- autodb(ChickWeight)
database with 2 relations
4 attributes: weight, Time, Chick, Diet
relation Chick: Chick, Diet; 50 records
  key 1: Chick
relation Time_Chick: Time, Chick, weight; 578 records
  key 1: Time, Chick
references:
Time_Chick.{Chick} -> Chick.{Chick}


Quick enough for interactive use for small data sets:

Unit: milliseconds
     min      lq     mean  median      uq     max neval
 15.9692 18.6487 19.74904 19.4545 20.8726 27.5227   100

Plotting with GraphViz

chick_code <- gv(chick_db)
DiagrammeR::grViz(chick_code)

Chick Chick (50 records) Chick ordered Diet factor Time_Chick Time_Chick (578 records) Time numeric Chick ordered weight numeric Time_Chick:FROM_chick->Chick:TO_chick

What we don’t see

Chick Chick (50 records) Chick ordered Diet factor Time_Chick Time_Chick (578 records) Time numeric Chick ordered weight numeric Time_Chick:FROM_chick->Chick:TO_chick

What we don’t see

constants constants (1 record) Diet factor Time_Chick Time_Chick (578 records) Time numeric Chick ordered weight numeric

Finding structure-breaking errors

title title (213 records) title character publication_id integer reference character year integer

How it works: FDs

Functional dependencies (FDs):

\(X \rightarrow y\) (\(X\) determines \(y\))

For any pair of rows, if their values in variables \(X\) are equal, then their values in variable \(y\) are equal

\(y := f(X)\) for some (unknown) \(f\)

How it works: FDs to schema

Bernstein synthesis, 1976

Phillip Bernstein

Synthesizing third normal form relations from functional dependencies. ACM Trans. Database Syst.

How it works: FD search

FDHits algorithm, 2024

Tobias Bleifuß

Thorsten Papenbrock

Thomas Bläsius

Martin Schirneck

Felix Naumann

Discovering functional dependencies through hitting set enumeration. Proc. ACM Manag. Data

What did I do?

Implementation in R (and lots of tests)

Diagrams (esp. for multiple keys)

Some search customisation (e.g. limiting what can be in \(X\))

Databases are vector-like

names(chick_db) <- c("Chick", "Measurement")

Chick Chick (50 records) Chick ordered Diet factor Measurement Measurement (578 records) Time numeric Chick ordered weight numeric Measurement:FROM_chick->Chick:TO_chick

Databases are vector-like

c(chick_db, db_simple)

Chick Chick (50 records) Chick ordered Diet factor Measurement Measurement (578 records) Time numeric Chick ordered weight numeric Measurement:FROM_chick->Chick:TO_chick title title (213 records) title character publication_id integer reference character year integer

Databases are vector-like

chick_db[c(1, 1, 2, 2, 2)]

Chick Chick (50 records) Chick ordered Diet factor Chick_1 Chick.1 (50 records) Chick ordered Diet factor Measurement Measurement (578 records) Time numeric Chick ordered weight numeric Measurement:FROM_chick->Chick:TO_chick Measurement:FROM_chick->Chick_1:TO_chick Measurement_1 Measurement.1 (578 records) Time numeric Chick ordered weight numeric Measurement_1:FROM_chick->Chick:TO_chick Measurement_1:FROM_chick->Chick_1:TO_chick Measurement_2 Measurement.2 (578 records) Time numeric Chick ordered weight numeric Measurement_2:FROM_chick->Chick:TO_chick Measurement_2:FROM_chick->Chick_1:TO_chick

Databases are vector-like

x <- db_simple
x$title <- chick_db$Chick

title title (50 records) Chick ordered Diet factor

Databases are vector-like

data.frame(
  FD = discover(ChickWeight),
  relvar = autodb(ChickWeight)
)
FD relvar
Chick -> Diet relation Chick (50 records)
Time, Chick -> weight relation Time_Chick (578 records)

No package dependencies

No package dependencies

No package dependencies

No package dependencies

The end

autodb is on CRAN

Future plans: better performance, normalising to BCNF, splitting to remove missing values

References:

  • Bernstein P. A. (1976) Synthesizing third normal form relations from functional dependencies
  • Bleifuss et al. (2024) Discovering functional dependencies through hitting set enumeration
  • Other data profiling research by the Information Systems Group at Hasso-Plattner-Institut

FAQ

Speed

time (s) DF to...
data nrow ncol #FDs FDs DB
ChickWeight 578 4 2 0.01 0.02
planes 3322 9 24 0.16 0.52
nudge 447 25 3732 3.05 18.25
liquor 1020561 24 1637 1274.50 3188.39

NAs: the standard-practice cheat

patient entry_date exit_date
1 2022-05-02 NA
2 2022-06-06 NA
3 2022-04-01 2022-10-07
4 2022-03-19 NA

Equality to NA is NA, not true or false.

This breaks functional dependencies.

Standard fix: pretend they’re non-missing and equal.

NAs: the standard-practice cheat

patient entry_date exit_date
1 2022-05-02 NA
2 2022-06-06 NA
3 2022-04-01 2022-10-07
4 2022-03-19 NA

patient patient (4 records) patient integer entry_date Date exit_date Date

NAs: the ideal (on the to-do list)

patient patient (4 records) patient integer entry_date Date patient_exit patient_exit (1 record) patient integer exit_date Date patient_exit:FROM_patient->patient:TO_patient

BCNF

2 functional dependencies
3 attributes: a, b, c
a, b -> c
   c -> a

c c c a a_b a_b a b c a_b:FROM_c->c:TO_c

BCNF

2 functional dependencies
3 attributes: a, b, c
a, b -> c
   c -> a

c c c a b_c b_c b c b_c:FROM_c->c:TO_c

BCNF

2 functional dependencies
3 attributes: a, b, c
a, b -> c
   c -> a

c c c a b_c b_c b c b_c:FROM_c->c:TO_c view view a b c view:TITLE->b_c:TITLE

Python: alteryx/autonormalize